feat(retrieval): PageIndex-style page-based agentic strategy (PR-B) by hallelx2 · Pull Request #25 · hallelx2/vectorless-engine

hallelx2 · 2026-05-27T16:42:01Z

Why

FinanceBench's debt-registration question scores 0/1 on our current section-based retrieval against a 508-node 10-K outline — the "pick a section_id" surface is too noisy. PageIndex hits 98.7% on the same benchmark with a smaller interface: 3 tools, page-range navigation, no embeddings.

This PR ports that interface to vectorless-engine as a new strategy + dedicated answer endpoint. The existing endpoints are unchanged; PageIndex is an opt-in, additive surface.

What ships

1. PageIndexStrategy (pkg/retrieval/pageindex_strategy.go)

A new Strategy + CostStrategy implementing a faithful port of PageIndex's three-tool reasoning loop:

get_document_structure() — returns the TOC tree as JSON (titles + page ranges, no body text).
get_pages(start_page, end_page) — returns the concatenated content of every section whose [PageStart, PageEnd] overlaps the requested range, clipped at PageContentLimit.
done(answer, cited_pages, reasoning) — terminates with the natural-language answer and the inclusive page ranges the answer relies on.

The system prompt is a port of the reference PageIndex demo (PageIndex/examples/agentic_vectorless_rag_demo.py:44-52) adapted to vle's JSON-action protocol (llmgate v0.2.0's Tools field is still scaffolding-only). When llmgate wires native tool calling, the action surface is unchanged.

Graceful degradation: the strategy uses a TOCProvider interface for get_document_structure observations. When the persisted documents.toc_tree column is NULL (pre-PR-A state), the provider's ErrNoTOC signal triggers a synthesised view derived from the section tree. Pre-merge of PR-A, every request degrades through this path — and that's fine. The strategy works without it.

Result.Reasoning carries the agent's final answer (/v1/answer/pageindex reads it directly). Result.SelectedIDs is the union of every section whose page range overlaps any cited range, so the existing /v1/query callers still get a section list. A new Result.PagesRead []PageReadEntry records every get_pages call (start/end/section_ids/char_count) for cost debugging and the reasoning trace.

2. POST /v1/answer/pageindex (internal/api/pageindex.go)

One round-trip: retrieval + answer + citations come back from a single agentic loop. No separate synthesis call — the model writes its answer inside the done action.
Trace token: the strategy's computePageIndexTraceToken hashes doc_id || "pageindex:" model || sorted cited page ranges, folding the strategy name into the model position so page-based and section-based tokens never collide. Stored in the existing replay store; /v1/replay returns byte-identical responses.
Per-page-range citations with answer-span quotes pulled via the existing SpanExtractor over the concatenated cited content (offsets back into that content).
reasoning_trace (opt-in via body reasoning:true or ?reasoning=true) lists every tool call with hop/tool/args/result_chars/sections_touched. Captured via a new OnEvent hook on PageIndexStrategy.
Streaming (stream:true) via Server-Sent Events. One event per tool call so callers watch the navigation in real time, terminated by an answer event carrying the full payload.
Per-request overrides for max_hops and max_pages_per_fetch without mutating shared Deps.

3. Config (pkg/config/config.go)

New RetrievalConfig.PageIndex block: enabled (default true), max_hops (8), page_content_limit (16000), model (inherit).
VLE_RETRIEVAL_PAGEINDEX_* env overrides (Enabled/MaxHops/PageContentLimit/Model).
Validate() accepts pageindex as a strategy name and rejects negative knobs.

4. Wiring (cmd/engine/main.go)

buildStrategy registers pageindex as a selection strategy choice.
A dedicated PageIndexStrategy instance is always wired into api.Deps.PageIndexStrategy (gated by retrieval.pageindex.enabled) regardless of which strategy is selected as default. So a deployment running chunked-tree for /v1/query still gets /v1/answer/pageindex.

5. OpenAPI + config.example.yaml

Full spec for the new endpoint: PageIndexAnswerRequest, PageIndexAnswerResponse, PageIndexCitation, PageReadEntry, PageIndexTraceEntry. Both application/json and text/event-stream content types under 200, with SSE event type documentation. Example config block with operator-readable comments.

Test plan

pkg/retrieval/pageindex_strategy_test.go — 15 unit tests: canonical 3-tool sequence, multi-range citations, MaxHops force-done (with and without recovery), TOC fallback (and persisted-TOC precedence), persistent bad JSON, out-of-range + partial-overlap page clamping, empty tree, loader-less degradation, content clipping, empty-citations refusal, trace-token stability + order invariance, parser tolerance.
internal/api/pageindex_test.go — 12 end-to-end handler tests via httptest with a mock LLM, mock storage, and a PageIndexTreeLoader test seam: happy path, reasoning trace (body + query param), bad request, document not found, disabled (config + nil strategy), no LLM, replay persistence verifying byte-equal response bytes, SSE event stream shape, per-request override caps the loop, TOC fallback.
pkg/config/config_test.go — 5 config tests: defaults, env overrides (all four knobs), enable-toggle from disabled, garbage env rejection, validation negatives.
Full go test ./... and go build ./... clean.
config.example.yaml parses cleanly via config.Load.
Existing tests unchanged.

Risk envelope

Opt-in at the request level. Existing endpoints (/v1/query, /v1/answer, /v1/replay) are unchanged. The new /v1/answer/pageindex is purely additive.
Works without PR-A. The strategy falls back to a synthesised TOC view when documents.toc_tree is NULL. Even if PR-A is never merged, this PR delivers value.
Test coverage gates merge. 32 new tests; existing tests still pass.

Out of scope (NOT in this PR)

TOC tree builder (pkg/tree/tree.go TOCNode + ingest stage). PR-A owns that. The TOCProvider interface is the integration point — when PR-A lands, the engine wires a DB-backed implementation reading documents.toc_tree.

Add a new retrieval Strategy modelled on PageIndex's 3-tool reasoning protocol (get_document_structure, get_pages, done). The model navigates by inclusive page range rather than by section ID — a tighter interface for paginated documents (SEC filings, academic PDFs) where the prior "pick a section ID from a 500-node outline" surface was too noisy. The loop: - get_document_structure() returns the document's TOC as JSON (titles + page ranges, no body text). Wires to a TOCProvider that reads documents.toc_tree when present; falls back to a synthesised view derived from the section tree when not, so the strategy works even before the TOC-builder PR lands. - get_pages(start_page, end_page) returns concatenated content of every section whose [PageStart, PageEnd] overlaps the requested range, clipped to PageContentLimit chars. - done(answer, cited_pages, reasoning) terminates with the final answer + the page ranges the answer relies on. SelectWithCost surfaces both the agent's literal answer string (via Result.Reasoning) and the set of section IDs whose page range overlaps any cited range (via Result.SelectedIDs), so the existing /v1/query + /v1/answer callers can consume the strategy without changes. A new PagesRead field on Result captures every get_pages call (start/end/section IDs/char count) for cost debugging and the reasoning-trace surface. Protocol uses the same JSON-action text shape AgenticStrategy proved (llmgate v0.2.0's Tools field is still scaffolding-only); when llmgate wires native tool calling the surface here is unchanged. The parser tolerates "tool" vs "action" keys, a "5-7"-string Pages alternative, and string-shaped cited_pages. Trace-token reuses ComputeTraceToken but folds the strategy name into the model position so page-based and section-based runs on the same doc/model don't collide, and tags the page ranges with "p:" so they share namespace with section IDs without colliding. 15 unit tests cover: the happy 3-tool sequence, multi-range citations, MaxHops force-done (both with and without recovery), TOC fallback, persisted-TOC precedence, persistent bad JSON, out-of-range and partial-overlap page clamping, empty tree, loader-less degradation, content clipping, empty-citations refusal, trace-token stability + order invariance, and parser tolerance for every documented input shape.

Wire the PageIndex strategy through a dedicated answer endpoint on the existing /v1 router. The endpoint: - Owns the full RAG round-trip in one request: retrieval + answer + citations come back from a single agentic loop. No separate synthesis call — the model emits its answer inside the done action and we surface it as `answer` on the response. - Emits page-grounded citations. One citation per page range the agent fetched (deduplicated), each carrying start_page / end_page / section_ids plus an answer-span quote pulled via the existing SpanExtractor over the cited content. Falls back gracefully when the LLM declines a quote. - Persists every successful response to the existing replay store under the strategy's deterministic trace_token. The token's input set is sorted cited page ranges (not section IDs), and the strategy name is folded into the hash so page-based and section-based tokens for the same doc/model never collide. - Supports an opt-in reasoning trace (body field `reasoning:true` or query param `?reasoning=true`) that surfaces per-hop tool calls + args + tool-result chars + sections touched, captured via a new OnEvent hook on PageIndexStrategy. - Streams via Server-Sent Events when `stream:true` is set on the body. One event per tool call (get_document_structure, get_pages, done) so callers WATCH the navigation in real time, terminated by an `answer` event carrying the full JSON response payload. - Honors per-request overrides for max_hops and max_pages_per_fetch without mutating shared Deps. Disabled deployments (retrieval.pageindex.enabled=false or no LLM client) return 501; missing documents 404; bad bodies 400. Adds `RetrievalConfig.PageIndex` (PageIndexBlock) with defaults (Enabled=true, MaxHops=8, PageContentLimit=16000) and matching VLE_RETRIEVAL_PAGEINDEX_* env overrides. Validation rejects negative knobs and accepts "pageindex" as a retrieval strategy. cmd/engine/main.go registers the strategy via buildStrategy when retrieval.strategy=pageindex, AND wires a standalone PageIndexStrategy instance into the api.Deps used by the answer endpoint — so the endpoint is available regardless of which selection strategy the deployment runs by default. Test coverage: 12 end-to-end handler tests (happy path, reasoning trace via body field + query param, bad request, not found, disabled in two modes, no LLM, replay persistence verifying byte-equal response bytes, SSE event stream shape, per-request override caps the loop, TOC fallback). Plus 5 config tests for defaults + env overrides + validation. A PageIndexTreeLoader function field on Deps acts as a test seam so handler tests can run end-to-end via httptest with an in-memory tree, without a real Postgres backend.

OpenAPI 3.1 spec for the new endpoint: - POST /v1/answer/pageindex documented with the PageIndexAnswerRequest body shape (document_id, query, optional model, max_hops, max_pages_per_fetch, stream, reasoning) and PageIndexAnswerResponse (answer, citations, hops_taken, usage, trace_token, pages_read, reasoning_trace). - PageIndexCitation, PageReadEntry, and PageIndexTraceEntry component schemas describe the page-grounded citation shape, the per-call navigation footprint, and per-hop reasoning trace entries. - The 200 response carries content for BOTH application/json (non-streaming) and text/event-stream (when stream:true) with documentation of the SSE event types: `started`, one event per tool call (get_document_structure / get_pages / done), and a terminal `answer` event carrying the full payload. - 501 covers both "no LLM client" and "retrieval.pageindex.enabled=false" so operators looking at the spec see the toggle that disables the endpoint. - QueryResponse's strategy enum gains "pageindex" so /v1/query responses returned by a pageindex-default deployment validate against the schema. - ?reasoning=true query parameter is documented as an alternative to the body's reasoning field. config.example.yaml: - retrieval.strategy comment lists every available strategy with a one-line description of each, so an operator picking a strategy can see what they're choosing between without reading code. - New retrieval.pageindex block with enabled / max_hops / page_content_limit / model knobs, default values matching the engine defaults, and a comment block explaining the three-tool loop, the trace_token / reasoning_trace / streaming differentiators, and the graceful-degradation behaviour when no TOC tree is persisted yet (the synthesised view fallback).

sourcery-ai

Sorry @hallelx2, you have reached your weekly rate limit of 500000 diff characters.

Please try again later or upgrade to continue using Sourcery

coderabbitai · 2026-05-27T16:42:09Z

Warning

Review limit reached

@hallelx2, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 45 minutes and 2 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 46ea6261-7690-41ef-943f-cf11a990580e

📥 Commits

Reviewing files that changed from the base of the PR and between 28ffc33 and 432524c.

📒 Files selected for processing (11)

cmd/engine/main.go
config.example.yaml
internal/api/pageindex.go
internal/api/pageindex_test.go
internal/api/server.go
openapi.yaml
pkg/config/config.go
pkg/config/config_test.go
pkg/retrieval/pageindex_strategy.go
pkg/retrieval/pageindex_strategy_test.go
pkg/retrieval/strategy.go

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/pageindex-strategy

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Copilot

Pull request overview

This PR adds an opt-in PageIndex-style page-range retrieval/answering path alongside the existing section-based retrieval APIs. It introduces a new page-based agentic strategy, a dedicated /v1/answer/pageindex endpoint, config/wiring, tests, and OpenAPI documentation.

Changes:

Added PageIndexStrategy with JSON tool-call loop, page reads, trace token support, TOC fallback, and tests.
Added /v1/answer/pageindex handler with JSON/SSE responses, reasoning trace, citations, and replay integration.
Added PageIndex config, engine wiring, OpenAPI schemas, and example configuration.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
`pkg/retrieval/strategy.go`	Adds `PagesRead` metadata to retrieval results.
`pkg/retrieval/pageindex_strategy.go`	Implements the new PageIndex page-based strategy.
`pkg/retrieval/pageindex_strategy_test.go`	Adds unit coverage for strategy behavior and parsing.
`pkg/config/config.go`	Adds PageIndex config defaults, env overrides, and validation.
`pkg/config/config_test.go`	Adds config tests for PageIndex settings.
`openapi.yaml`	Documents the new endpoint and schemas.
`internal/api/server.go`	Wires the new route and API dependencies.
`internal/api/pageindex.go`	Implements the PageIndex answer endpoint and SSE path.
`internal/api/pageindex_test.go`	Adds handler tests for JSON/SSE/replay/error paths.
`config.example.yaml`	Documents PageIndex configuration.
`cmd/engine/main.go`	Wires PageIndex as a selectable strategy and dedicated endpoint strategy.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+	// Build a citation per UNIQUE page range present in PagesRead.
+	// The set of pages the model "read" is a superset of what it
+	// cited — some get_pages calls don't end up in the final
+	// cited_pages list — but the union is the right cone of trust
+	// to surface as evidence. The trace token is computed over
+	// only the strictly-cited ranges, which the strategy already
+	// has, so citation drift doesn't break replay.
+	seen := make(map[[2]int]struct{}, len(res.PagesRead))


+	citations := d.buildPageIndexCitations(r.Context(), t, res, body.Query, body.Model)
+	final := map[string]any{
+		"document_id": body.DocumentID,
+		"query":       body.Query,
+		"answer":      res.Reasoning,
+		"citations":   citations,
+		"strategy":    strat.Name(),
+		"model":       budget.ModelName,
+		"hops_taken":  res.HopsTaken,
+		"usage": map[string]any{
+			"input_tokens":  res.Usage.InputTokens,
+			"output_tokens": res.Usage.OutputTokens,
+			"total_tokens":  res.Usage.TotalTokens,
+			"cost_usd":      res.Usage.CostUSD,
+			"llm_calls":     res.Usage.LLMCalls,
+		},
+		"elapsed_ms":  time.Since(started).Milliseconds(),
+		"trace_token": res.TraceToken,
+		"pages_read":  res.PagesRead,
+	}
+	emitSSE("answer", final)


+	resp := map[string]any{
+		"document_id": body.DocumentID,
+		"query":       body.Query,
+		"answer":      res.Reasoning, // strategy stores the agent's answer here
+		"citations":   citations,
+		"strategy":    perReq.Name(),
+		"model":       budget.ModelName,
+		"hops_taken":  res.HopsTaken,
+		"usage": map[string]any{
+			"input_tokens":  res.Usage.InputTokens,
+			"output_tokens": res.Usage.OutputTokens,
+			"total_tokens":  res.Usage.TotalTokens,
+			"cost_usd":      res.Usage.CostUSD,
+			"llm_calls":     res.Usage.LLMCalls,
+		},
+		"elapsed_ms":  time.Since(started).Milliseconds(),
+		"trace_token": res.TraceToken,
+		"pages_read":  res.PagesRead,


+	final := map[string]any{
+		"document_id": body.DocumentID,
+		"query":       body.Query,
+		"answer":      res.Reasoning,
+		"citations":   citations,
+		"strategy":    strat.Name(),
+		"model":       budget.ModelName,
+		"hops_taken":  res.HopsTaken,
+		"usage": map[string]any{
+			"input_tokens":  res.Usage.InputTokens,
+			"output_tokens": res.Usage.OutputTokens,
+			"total_tokens":  res.Usage.TotalTokens,
+			"cost_usd":      res.Usage.CostUSD,
+			"llm_calls":     res.Usage.LLMCalls,
+		},
+		"elapsed_ms":  time.Since(started).Milliseconds(),
+		"trace_token": res.TraceToken,
+		"pages_read":  res.PagesRead,


+	if s.TOC != nil {
+		raw, err := s.TOC.GetTOC(ctx, t.DocumentID)
+		if err == nil && len(raw) > 0 {
+			return string(raw)
+		}
+		// Log and degrade — the strategy must keep going.
+		if err != nil {
+			log.Printf("retrieval: pageindex TOC fetch failed (degrading to synthesised view): %v", err)
+		}


hallelx2 added 3 commits May 27, 2026 17:21

Copilot AI review requested due to automatic review settings May 27, 2026 16:42

sourcery-ai Bot reviewed May 27, 2026

View reviewed changes

Copilot started reviewing on behalf of hallelx2 May 27, 2026 16:42 View session

hallelx2 merged commit e183ca7 into main May 27, 2026
5 of 9 checks passed

hallelx2 deleted the feat/pageindex-strategy branch May 27, 2026 16:44

Copilot AI reviewed May 27, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(retrieval): PageIndex-style page-based agentic strategy (PR-B)#25

feat(retrieval): PageIndex-style page-based agentic strategy (PR-B)#25
hallelx2 merged 3 commits into
mainfrom
feat/pageindex-strategy

hallelx2 commented May 27, 2026 •

edited

Loading

Uh oh!

sourcery-ai Bot left a comment

Uh oh!

coderabbitai Bot commented May 27, 2026

Review limit reached

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

hallelx2 commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why

What ships

Test plan

Risk envelope

Out of scope (NOT in this PR)

Uh oh!

sourcery-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot commented May 27, 2026

Review limit reached

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hallelx2 commented May 27, 2026 •

edited

Loading